Introduction

Marketing analytics refers to the study of customer data to evaluate and devise marketing activities which has been widely incorporated by businesses across the globe especially in the telecommunications industry. In depth, analysis of the customer data was used to determine the main factors of consumer action, then enhance the company marketing strategies and maximize return on investments from the wonders of their marketing analytics (Cote, 2021).

In this project, a real corporate dataset of a pilot campaign launched by a telecommunication company was used, herein referred to as ZenTel (pseudonym due to confidentiality). ZenTel is one of the largest telecommunication companies in Malaysia which has been employing customer-centric solutions to facilitate seamless, consistent, and excellent customer experiences by delivering the best value for money offerings and rate plans to its customers.

Problem Statement

ZenTel had recently proposed to migrate their existing telco platform to a more advanced platform to improve user and customer experience. To optimise the platform migration, ZenTel had decided to come up with an initiative called “Right Planning”, which aimed to migrate all customers’ old rate plan to a newer rate plan at a lower cost with better benefits. With “Right Planning”, ZenTel was able to enlighten customer experience with new rate plans with better benefits and offers, at the same time, remove the outdated rates plans and standardise the rate plans information stored in the new platform.

To evaluate the efficacy of the proposed campaign, the Base Management Team at ZenTel conducted a trial run by selecting a group of customers before officially launching the “Right Planning” to their seven million customers.The aim goal of the study is to evaluate the pilot campaign effectiveness by introducing new rate plans that provide superior benefits and offers and understand customer behavior whom tends to take up the offer, while simultaneously eliminating obsolete rate plans and establishing uniformity in the information regarding rate plans stored in the new platform.

Research Questions

  1. EDA: How effective is the “Right Planning” pilot campaign in terms of customer participation?
  2. EDA: What are the demographic and behavioral characteristics of customers who are more likely to opt-in for the “Right Planning” pilot campaign?
  3. Classification: How can we accurately predict which customers will opt-in for the “Right Planning” campaign based on their usage and revenue behavior?
  4. Regression: How revenue generated before the “Right Planning” campaign correlates with the customers’ behavior in terms of usage?

Research Objectives

  1. To evaluate the effectiveness of the “Right Planning” pilot campaign.
  2. To identify the campaign offer takers’ profile.
  3. To develop a classification model which accurately predicts if the customers will opt-in for the “Right Planning” campaign based on usage and revenue behavior.
  4. To investigate the relationship between the customers’ behavior in terms of usage and revenue generated before the campaign using multiple linear regression.

Dataset Description

  • A set of real dataset provided by the Base Management Team in one of the largest telecommunication companies in Malaysia.
  • A total of 7272 records campaign data that launched between 5th October 2022 and 11th October 2022.
  • Target base is a group of active Prepaid Malaysian customers.
  • Dataset contains the target base’ demographic information, usage, and revenue activity before and after the campaign.

Methodology

This section displayed the steps from data understanding, data exploration, data cleaning, data-preprocessing and data transformation before conducting data analysis and predictive modelling.

Part 1: Import and Load Libraries

# Install required packages
install.packages(c("readxl", "caret", "pryr", "corrplot", "dplyr", "ggplot2", "coefplot", "infotheo", "lubridate", "randomForest", "Rcpp", "devtools", "tidyr", "reshape2", "cowplot","e1071"))
# Load installed packages
library(readxl)
library(knitr)
library(caret)
library(pryr)
library(corrplot)
library(dplyr)
library(ggplot2)
library(coefplot)
library(infotheo)
library(lubridate)
library(randomForest)
library(Rcpp)
library(devtools)
library(tidyr)
library(reshape2)
library(cowplot)
library(e1071)

Part 2: Metadata & Dataset Overview (Before Data Cleaning)

Table below shows the metadata of the dataset collected.

# Load metadata 
metadata <- read_excel("/Users/kharshin/Campaign_Dataset_Description.xlsx")
kable(metadata)
Variable Description  DataType Unit
ID Customer ID String N/A
TENURE Customer duration with Kation since registration date Integer Months
AGE Customer age Integer Years
GENDER Customer gender String N/A
NATIONALITY Customer nationality String N/A
STATE Customer hometown (state) String N/A
STATUS_BEFORE Customer status before campaign launched. String N/A
STATUS_AFTER Customer status after campaign ended. String N/A
OFFER_TAKER Indicator for customers who opted-in the migration plan. Boolean N/A
OFFER_TAKE_UP_DT Date for customers who opted-in the migration plan. Date N/A
DATA_PURC_BEFORE Indicator for customer who purchased data before campaign launched. Boolean N/A
DATA_PURC_AFTER Indicator for customer who purchased data after campaign ended. Boolean N/A
DATA_CHRG_BEFORE  Total amount of data charged before campaign launched. Float RM
DATA_CHRG_AFTER Total amount of data charged after campaign ended. Float RM
DATA_USG_BEFORE Data usage before campaign launched. Float MB
DATA_USG_AFTER Data usage after campaign ended. Float MB
VOICE_USG_BEFORE Voice usage before campaign launched. Float Minutes
VOICE_USG_AFTER Voice usage after campaign ended. Float Minutes
RLD_IND_BEFORE Indicator for customer who reload before campaign launched. Boolean N/A
RLD_IND_AFTER Indicator for customer who reload after campaign ended. Boolean N/A
RLD_AMT_BEFORE Total of reload amount before campaign launched. Float RM
RLD_AMT_AFTER Total of reload amount after campaign ended. Float RM
ARPU_BEFORE ARPU before campaign launched. Float RM
CPA_RVN_BEFORE Total added value service before campaign launched. Float RM
CPA_RVN_AFTER Total added value service after campaign ended. Float RM
ARPU_AFTER ARPU after campaign ended. Float RM
ACTIVITY_DAYS_AFTER  Silent days after campaign ended. Integer Days
ACTIVITY_STATUS_AFTER Customer activity status after campaign ended. String N/A

Table below shows the first 5 rows of the dataset collected.

# Load dataset
dataset <- read_excel("/Users/kharshin/Dataset.xlsx")
head(dataset, n = 5)
## # A tibble: 5 × 27
##   TENURE   AGE GENDER NATIONALITY STATE   STATUS_BEFORE STATUS_AFTER OFFER_TAKER
##    <dbl> <dbl> <chr>  <chr>       <chr>   <chr>         <chr>        <chr>      
## 1     90    28 Female Malaysia    JOHORE  Active        Active       Y          
## 2    204    81 Male   Malaysia    PAHANG  Active        Active       Y          
## 3    120    84 Male   Malaysia    KLANG … Active        Active       N          
## 4    199    82 Male   Malaysia    PAHANG  Active        Active       Y          
## 5     30 -9999 ?      Malaysia    SABAH   Active        Active       N          
## # ℹ 19 more variables: OFFER_TAKE_UP_DT <chr>, DATA_PURC_BEFORE <chr>,
## #   DATA_PURC_AFTER <chr>, DATA_CHRG_BEFORE <chr>, DATA_CHRG_AFTER <chr>,
## #   DATA_USG_BEFORE <dbl>, DATA_USG_AFTER <dbl>, VOICE_USG_BEFORE <dbl>,
## #   VOICE_USG_AFTER <dbl>, RLD_IND_BEFORE <chr>, RLD_IND_AFTER <chr>,
## #   RLD_AMT_BEFORE <chr>, RLD_AMT_AFTER <chr>, CPA_RVN_BEFORE <chr>,
## #   CPA_RVN_AFTER <chr>, ARPU_BEFORE <chr>, ARPU_AFTER <chr>,
## #   ACTVIITY_DAYS_AFTER <dbl>, ACTIVITY_STATUS_AFTER <chr>


This section displayed the information of:
i. Number of columns and rows

cat("Number of columns:", ncol(dataset), "\nNumber of rows:", nrow(dataset))
## Number of columns: 27 
## Number of rows: 7272

A total of 27 columns and 7,272 records.


  1. Structure
str(dataset)
## tibble [7,272 × 27] (S3: tbl_df/tbl/data.frame)
##  $ TENURE               : num [1:7272] 90 204 120 199 30 16 86 34 12 35 ...
##  $ AGE                  : num [1:7272] 28 81 84 82 -9999 ...
##  $ GENDER               : chr [1:7272] "Female" "Male" "Male" "Male" ...
##  $ NATIONALITY          : chr [1:7272] "Malaysia" "Malaysia" "Malaysia" "Malaysia" ...
##  $ STATE                : chr [1:7272] "JOHORE" "PAHANG" "KLANG VALLEY" "PAHANG" ...
##  $ STATUS_BEFORE        : chr [1:7272] "Active" "Active" "Active" "Active" ...
##  $ STATUS_AFTER         : chr [1:7272] "Active" "Active" "Active" "Active" ...
##  $ OFFER_TAKER          : chr [1:7272] "Y" "Y" "N" "Y" ...
##  $ OFFER_TAKE_UP_DT     : chr [1:7272] "44691" "44691" "?" "44722" ...
##  $ DATA_PURC_BEFORE     : chr [1:7272] "Y" "N" "N" "Y" ...
##  $ DATA_PURC_AFTER      : chr [1:7272] "N" "N" "N" "N" ...
##  $ DATA_CHRG_BEFORE     : chr [1:7272] "15" "0" "0" "15" ...
##  $ DATA_CHRG_AFTER      : chr [1:7272] "0" "0" "0" "0" ...
##  $ DATA_USG_BEFORE      : num [1:7272] 0 0 0 0 6.53 ...
##  $ DATA_USG_AFTER       : num [1:7272] 0 0 0 0 1.02 ...
##  $ VOICE_USG_BEFORE     : num [1:7272] 732.4 49.9 17.1 16.2 14.2 ...
##  $ VOICE_USG_AFTER      : num [1:7272] 1535.917 0.667 0 8.317 0 ...
##  $ RLD_IND_BEFORE       : chr [1:7272] "N" "N" "N" "N" ...
##  $ RLD_IND_AFTER        : chr [1:7272] "Y" "N" "N" "N" ...
##  $ RLD_AMT_BEFORE       : chr [1:7272] "?" "?" "?" "?" ...
##  $ RLD_AMT_AFTER        : chr [1:7272] "5" "?" "?" "?" ...
##  $ CPA_RVN_BEFORE       : chr [1:7272] "?" "?" "?" "?" ...
##  $ CPA_RVN_AFTER        : chr [1:7272] "?" "?" "?" "?" ...
##  $ ARPU_BEFORE          : chr [1:7272] "?" "13.65" "4.32" "4.68" ...
##  $ ARPU_AFTER           : chr [1:7272] "5.15" "0.26" "?" "2.73" ...
##  $ ACTVIITY_DAYS_AFTER  : num [1:7272] 0 1 5 5 0 9 0 33 11 1 ...
##  $ ACTIVITY_STATUS_AFTER: chr [1:7272] "DURING & AFTER CAMP" "DURING & AFTER CAMP" "DURING & AFTER CAMP" "DURING & AFTER CAMP" ...

There is a total of 7 numerical columns and 20 categorical columns. However, several numerical columns are misclassified as categorical because the blank or missing values are automatically detected as a string of ? in R. Columns affected are RLD_AMT_BEFORE, RLD_AMT_AFTER, CPA_RVN_BEFORE, CPA_RVN_AFTER, ARPU_BEFORE, ARPU_AFTER, DATA_PURC_BEFORE and DATA_PURC_AFTER. Meanwhile, OFFER_TAKE_UP_DT is misclassified as numerical but it is actually a date variable.


  1. Count of missing values within the columns
# Check missing values in categorical columns and print column names if there are null values
cat_missing <- sapply(dataset, function(x) {
  if (is.factor(x) && sum(is.na(x)) > 0) {
    cat("Column:", names(x), "has", sum(is.na(x)), "missing value(s).\n")
  }
  sum(is.na(x))
})

# Check missing values in numerical columns and print column names if there are null values
num_missing <- sapply(dataset, function(x) {
  if (is.numeric(x) && sum(is.na(x)) > 0) {
    cat("Column:", names(x), "has", sum(is.na(x)), "missing value(s).\n")
  }
  sum(is.na(x))
})

No missing values are found but based on the view in the dataset structure, there are ? missing values within the columns. Hence, data cleaning is required.


Part 3: Exploratory Data Analysis

This section displayed the graph distributions for all the columns of the dataset. Violin plots and bar charts are used.

# Create function to perform data profiling on a data frame
# Input: x - data frame
# Output: Prints plots for numeric columns and categorical columns
data_profiling = function(x) {
  
  # Iterate over column names
  for (col_name in colnames(x)) {
    
    # Check if column is numeric
    if (is.numeric(x[[col_name]])) {
      
      # Generate violin plot for numeric columns
      p = ggplot(x, aes(x = col_name, y = x[[col_name]], fill = col_name)) +
        geom_violin() +
        labs(x = "Column", y = "Range", title = paste("Violin Plot of", col_name)) +
        theme_minimal() 
      print(p)
      
    }
    
    # Column is not numeric (assumed categorical)
    else {
      # Calculate counts for each category
      counts = table(x[[col_name]])
      
      # Create a data frame for plotting
      df_counts = data.frame(category = names(counts), count = as.numeric(counts))
      
      # Generate bar plot for categorical columns
      p = ggplot(df_counts, aes(x = count, y = category)) +
        geom_col(fill = "steelblue") +
        labs(x = "Count", y = "Category", title = paste("Bar Plot of", col_name)) +
        theme_minimal() 
      print(p)
    }
  }
}

# Display graphs
data_profiling(dataset)

Based on the graphs,it can be observed that:

-AGE contains a minimum age of -9999, which seems to be abnormal as human age should between range of 0 to 100 only but not negative values. This phenomenon occurred because customers’ age information was missing due to system or human error thus replacing with -9999.

-ARPU_BEFORE,ARPU_AFTER, CPA_RVN_BEFORE, CPA_RVN_AFTER, DATA_CHRG_BEFORE,DATA_CHRG_AFTER, RLD_AMT_BEFORE, and RLD_AMT_AFTER are futher justified as numeric instead of categorical. Upon checking, these columns were filled up with ? instead of 0 values which wrongly tagged them as a class variable.

-STATE contained inconsistency data where Malaysia comprises 13 states and 3 federal territories only but the graphs showed 20 unique values.

-GENDER commonly only has 2 unique values of male and female but there are 4 different values.

-OFFER_TAKE_UP_DT consists of null values and date values.

AGE, GENDER, STATE, ARPU_BEFORE, ARPU_AFTER, CPA_RVN_BEFORE, CPA_RVN_AFTER, DATA_CHRG_BEFORE, DATA_CHRG_AFTER, RLD_AMT_BEFORE and RLD_AMT_AFTER and OFFER_TAKE_UP_DT need to undergo data cleaning and pre-processing to ensure the data is cleaned before further analysis and modelling.


Part 4: Data Cleaning

This section displayed data cleaning for all the four data quality issues (inconsistent data, intentional data, incomplete data and noisy data) found in the dataset as mentioned in Exploratory Data Analysis section.

  1. Modify intentional data - AGE
# Remove records of 'AGE' with value '-9999'
dataset <- subset(dataset, AGE != -9999)

Records with values of -9999 representing no record is stated for the customer age. Thus, these records are then removed as they did not bring any useful information.


  1. Modify incomplete data - Replace ? to NA then fill with 0
# replace "?" to "NA"
dataset <- replace(dataset, dataset=='?', NA)

# replace "NA" with 0
dataset$ARPU_BEFORE = replace(dataset$ARPU_BEFORE, is.na(dataset$ARPU_BEFORE), 0)
dataset$ARPU_AFTER = replace(dataset$ARPU_AFTER, is.na(dataset$ARPU_AFTER), 0)
dataset$CPA_RVN_BEFORE = replace(dataset$CPA_RVN_BEFORE, is.na(dataset$CPA_RVN_BEFORE), 0)
dataset$CPA_RVN_AFTER = replace(dataset$CPA_RVN_AFTER, is.na(dataset$CPA_RVN_AFTER), 0)
dataset$RLD_AMT_BEFORE = replace(dataset$RLD_AMT_BEFORE, is.na(dataset$RLD_AMT_BEFORE), 0)
dataset$RLD_AMT_AFTER = replace(dataset$RLD_AMT_AFTER, is.na(dataset$RLD_AMT_AFTER), 0)
dataset$DATA_CHRG_BEFORE = replace(dataset$DATA_CHRG_BEFORE, is.na(dataset$DATA_CHRG_BEFORE), 0)
dataset$DATA_CHRG_AFTER = replace(dataset$DATA_CHRG_AFTER, is.na(dataset$DATA_CHRG_AFTER), 0)

Columns which mistagged as categorical with values of ? is replaced with NA then 0, are now corrected to numerical data type. 0 is chosen to replace the unknown values due to the business logic in which null is shown when no mapping results.


  1. Modify inconsistent data - STATE
# Standardize the name for the states
dataset$STATE[dataset$STATE == "JOHORE"] = "JOHOR"
dataset$STATE[dataset$STATE == "KLANG VALLEY"] = "WILAYAH PERSEKUTUAN"
dataset$STATE[dataset$STATE == "MALACCA"] = "MELAKA"
dataset$STATE[dataset$STATE == "N SEMBILAN"] = "NEGERI SEMBILAN"
dataset$STATE[dataset$STATE == "PULAU PINANG"] = "PENANG"
dataset$STATE[dataset$STATE == "SEREMBAN/MELAKA"] = "MELAKA"

The inconsistencies in the naming of the states where the names are mixed with Malay and English that caused the duplicate values in the variable are standardzied using English naming convention.


  1. Modify inconsistent data - GENDER
dataset$GENDER = replace(dataset$GENDER, is.na(dataset$GENDER), "Unspecified")
dataset <- subset(dataset, GENDER != "Unspecified")

The values of Unspecified and ? (which then replaced with NA) are grouped together as they represented unknown gender.These group of people are then removed from the dataset and the GENDER variable consists only the value of Male and Female.


  1. Modify incomplete data - OFFER_TAKE_UP_DT

    OFFER_TAKE_UP_DT which was supposed to be in date format, will still remained as string data type as if the corresponding cell of the OFFER_TAKER attribute is No, it meant the customer did not take up the offer and naturally there would not be any offer take-up date recorded. This variable is then removed as dates do not bring any meaningful information in this study.


  1. Standardize the numeric values into 2 decimal place
# Format the numeric into 2 decimal points
dataset$ARPU_BEFORE = round(as.numeric(dataset$ARPU_BEFORE), 2)
dataset$ARPU_AFTER = round(as.numeric(dataset$ARPU_AFTER), 2)
dataset$CPA_RVN_BEFORE = round(as.numeric(dataset$CPA_RVN_BEFORE), 2)
dataset$CPA_RVN_AFTER = round(as.numeric(dataset$CPA_RVN_AFTER), 2)
dataset$RLD_AMT_BEFORE = round(as.numeric(dataset$RLD_AMT_BEFORE), 2)
dataset$RLD_AMT_AFTER = round(as.numeric(dataset$RLD_AMT_AFTER), 2)
dataset$DATA_CHRG_BEFORE = round(as.numeric(dataset$DATA_CHRG_BEFORE), 2)
dataset$DATA_CHRG_AFTER = round(as.numeric(dataset$DATA_CHRG_AFTER), 2)


  1. Modify noisy data - Outliers

    The variables associated with this data quality issue were ARPU_BEFORE, ARPU_AFTER, CPA_RVN_BEFORE, CPA_RVN_AFTER, DATA_CHRG_BEFORE, DATA_CHRG_AFTER, RLD_AMT_BEFORE, RLD_AMT_AFTER, DATA_USG_BEFORE, DATA_USG_AFTER, VOICE_USG_BEFORE, VOICE_USG_AFTER, AGE and TENURE. Based on domain knowledge, these extreme values carried informative insights.Outliers detected in these variables are not removed as the extreme values are legitimate observations that were part of the sample which currently studying, thus leaving it in the dataset.


  2. Remove columns which is meaningless and will not be used for the analysis from the study

# Drop the "meaningless" columns 
dataset = subset(dataset, select = -c(NATIONALITY, STATUS_BEFORE, OFFER_TAKE_UP_DT))

# Backup the cleaned dataset for analysis 
cleaned_dataset <- dataset

# Select and print out the "AFTER" columns
cols_to_drop = grep("AFTER$", names(dataset), value = TRUE)

# Drop the "AFTER" columns
dataset[, cols_to_drop] = NULL

As mentioned previously, the campaign is launched between a specific date and the group of customers involved are active Prepaid Malaysian customers. Hence, NATIONALITY, STATUS_BEFORE and OFFER_TAKE_UP_DT is not needed in the analysis. All the variables with naming of AFTER is removed as these variables are needed when there’s a need to study the performance of pre-campaign and post-campaign.


Part 5: Dataset Overview (After Data Cleaning)

This section displayed the data overview after data cleaning.

# Check missing values
colSums(is.na(dataset))
##           TENURE              AGE           GENDER            STATE 
##                0                0                0                0 
##      OFFER_TAKER DATA_PURC_BEFORE DATA_CHRG_BEFORE  DATA_USG_BEFORE 
##                0                0                0                0 
## VOICE_USG_BEFORE   RLD_IND_BEFORE   RLD_AMT_BEFORE   CPA_RVN_BEFORE 
##                0                0                0                0 
##      ARPU_BEFORE 
##                0
# Check data structure
str(dataset)
## tibble [7,157 × 13] (S3: tbl_df/tbl/data.frame)
##  $ TENURE          : num [1:7157] 90 204 120 199 16 86 34 12 35 56 ...
##  $ AGE             : num [1:7157] 28 81 84 82 82 82 85 85 88 86 ...
##  $ GENDER          : chr [1:7157] "Female" "Male" "Male" "Male" ...
##  $ STATE           : chr [1:7157] "JOHOR" "PAHANG" "WILAYAH PERSEKUTUAN" "PAHANG" ...
##  $ OFFER_TAKER     : chr [1:7157] "Y" "Y" "N" "Y" ...
##  $ DATA_PURC_BEFORE: chr [1:7157] "Y" "N" "N" "Y" ...
##  $ DATA_CHRG_BEFORE: num [1:7157] 15 0 0 15 15 0 0 0 0 0 ...
##  $ DATA_USG_BEFORE : num [1:7157] 0 0 0 0 0.0185 ...
##  $ VOICE_USG_BEFORE: num [1:7157] 732.4 49.9 17.1 16.2 13.6 ...
##  $ RLD_IND_BEFORE  : chr [1:7157] "N" "N" "N" "N" ...
##  $ RLD_AMT_BEFORE  : num [1:7157] 0 0 0 0 0 0 0 0 0 0 ...
##  $ CPA_RVN_BEFORE  : num [1:7157] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ARPU_BEFORE     : num [1:7157] 0 13.65 4.32 4.68 4.4 ...
# Check first 5 rows of dataset
head(dataset, n = 5)
## # A tibble: 5 × 13
##   TENURE   AGE GENDER STATE        OFFER_TAKER DATA_PURC_BEFORE DATA_CHRG_BEFORE
##    <dbl> <dbl> <chr>  <chr>        <chr>       <chr>                       <dbl>
## 1     90    28 Female JOHOR        Y           Y                              15
## 2    204    81 Male   PAHANG       Y           N                               0
## 3    120    84 Male   WILAYAH PER… N           N                               0
## 4    199    82 Male   PAHANG       Y           Y                              15
## 5     16    82 Male   PERAK        Y           Y                              15
## # ℹ 6 more variables: DATA_USG_BEFORE <dbl>, VOICE_USG_BEFORE <dbl>,
## #   RLD_IND_BEFORE <chr>, RLD_AMT_BEFORE <dbl>, CPA_RVN_BEFORE <dbl>,
## #   ARPU_BEFORE <dbl>

There are a total of 7,157 records (after removing the intentional age and unknown gender) with 13 selected variables to continue our analysis such as TENURE, AGE, GENDER, STATE, DATA_PURC_BEFORE, DATA_CHRG_BEFORE, DATA_USG_BEFORE, VOICE_USG_BEFORE, RLD_IND_BEFORE, RLD_AMT_BEFORE, CPA_RVN_BEFORE, ARPU_BEFORE and OFFER_TAKER. No missing values are found and the type of variables are now correct.


Part 6: Data Pre-Processing and Transformation

This section explained how data is preprocessed before proceed to analysis and modelling. Categorical variables is transformed using one-hot encoding and heatmap is used to check the correlation between variables. Highly correlated variables will be dropped and dataset will be split into training (80%) and testing (20%) set.

  1. Categorical variables such as GENDER and STATE are converted to numeric values by one-hot encoding, whereas DATA_PURC_BEFORE and RLD_IND_BEFORE are converted to numeric values using remapping method.
# Replace column values of "Y" with 1 and "N" with 0 
dataset$DATA_PURC_BEFORE <- as.integer(ifelse(dataset$DATA_PURC_BEFORE == "Y", 1, ifelse(dataset$DATA_PURC_BEFORE == "N", 0, dataset$DATA_PURC_BEFORE)))
dataset$RLD_IND_BEFORE <- as.integer(ifelse(dataset$RLD_IND_BEFORE == "Y", 1, ifelse(dataset$RLD_IND_BEFORE == "N", 0, dataset$RLD_IND_BEFORE)))

# Apply one-hot encoding to the categorical variables
cols_to_encode = c("GENDER", "STATE")
one_hot = dummyVars(" ~ .", data = dataset[, cols_to_encode])
encoded_df = data.frame(predict(one_hot, newdata = dataset))

# Preview the first 5 rows
head(encoded_df, 5)
##   GENDERFemale GENDERMale STATEJOHOR STATEKEDAH STATEKELANTAN STATEMELAKA
## 1            1          0          1          0             0           0
## 2            0          1          0          0             0           0
## 3            0          1          0          0             0           0
## 4            0          1          0          0             0           0
## 5            0          1          0          0             0           0
##   STATENEGERI.SEMBILAN STATEPAHANG STATEPENANG STATEPERAK STATEPERLIS
## 1                    0           0           0          0           0
## 2                    0           1           0          0           0
## 3                    0           0           0          0           0
## 4                    0           1           0          0           0
## 5                    0           0           0          1           0
##   STATESABAH STATESARAWAK STATESELANGOR STATETERENGGANU
## 1          0            0             0               0
## 2          0            0             0               0
## 3          0            0             0               0
## 4          0            0             0               0
## 5          0            0             0               0
##   STATEWILAYAH.PERSEKUTUAN
## 1                        0
## 2                        0
## 3                        1
## 4                        0
## 5                        0

Although variables GENDER and STATE have undergone data transformation, they will not be included in modelling as demographic variables will create bias towards the prediction output. For instance, it is impossible for Zentel to launch a campaign only for male and specific state such as Sabah only. The campaign launched should targets every customer.


  1. Select numerical variables and plot heatmap to investigate the correlation between numerical variables
# Select numeric variables 
non_encoded_cols = setdiff(colnames(dataset), cols_to_encode)
non_encoded_list = dataset[, non_encoded_cols]
non_encoded_list = subset(non_encoded_list, select = -OFFER_TAKER)

# Create function to extract the upper triangle of a correlation matrix
# Input: cormat - correlation matrix
# Output: upper triangle of the correlation matrix with lower triangle set to NA
get_upper_tri = function(cormat){
  
  # Set lower triangle of the correlation matrix to NA
  cormat[lower.tri(cormat)] <- NA
  
  # Return the modified correlation matrix
  return(cormat)
}

# Plot heatmap 
cormat = round(cor(non_encoded_list),2)
upper_tri = get_upper_tri(cormat)
melted_cormat = melt(upper_tri, na.rm = TRUE)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))+
  coord_fixed()

Based on the correlation plot, it can be observed that the pairs of CPA_RVN_BEFORE, RLD_AMT_BEFORE and ARPU_BEFORE and DATA_CHRG_BEFORE and DATA_PURC_BEFORE are highly correlated with each other. Hence,CPA_RVN_BEFORE, RLD_AMT_BEFORE and DATA_CHRG_BEFORE are dropped as these two variables are contributing to the value of ARPU. When the values increase, ARPU increase. On top of that, based on domain knowledge, ARPU_BEFORE is refering to the average spending of a customer which brings meaningful information.


  1. Remove high correlated variables
# Remove high correlation columns 
final_df = subset(non_encoded_list, select=c(-CPA_RVN_BEFORE, -RLD_AMT_BEFORE,-DATA_CHRG_BEFORE))


  1. Assign target variable
# Assign target 
final_df$OFFER_TAKER = ifelse(dataset$OFFER_TAKER == "Y", 1, 0)


  1. Split the dataset into train and test set
# Set the random seed
set.seed(123)

# Split the data into train and test set 
train_idx = createDataPartition(final_df$OFFER_TAKER, p = 0.8, list = FALSE)
train_data = final_df[train_idx,]
test_data = final_df[-train_idx,]

# Check dimensions of train and test set 
cat(" Dimensions of train data:", dim(train_data), "\n", "Dimensions of test data:", dim(test_data), "\n")
##  Dimensions of train data: 5726 8 
##  Dimensions of test data: 1431 8

After performing the data pre-processing, data transformation, feature selection and train test split, a total of 7 independent variables will be used to train the model.


Results and Findings

This section displayed the results and discussion for all the 4 research objectives.

Objective 1: Effectiveness of “Right Planning” Pilot Campaign

Effectiveness of “Right Planning pilot campaign is evaluated by 2 aspects - the total count of offer takers among all customers and the total count of offer takers among all customers who stay active after the pilot campaign.

# Count the frequencies of "Y" and "N" values in the column
value_counts <- table(cleaned_dataset$OFFER_TAKER)
value <- value_counts["Y"]
# Calculate the percentage of offer takers only - "Y" 
percentage_y <- (value_counts["Y"] / sum(value_counts)) * 100

# Count the frequencies of customers who remain active after opted-in for the pilot campaign
offertaker <- subset(cleaned_dataset, OFFER_TAKER == "Y")
value_counts <- table(offertaker$ACTIVITY_STATUS_AFTER)
# Calculate the sum of frequencies for the desired categories
sum_active <- sum(value_counts[c("DURING & AFTER CAMP", "BEFORE & AFTER CAMP", "DURING CAMP THEN REMAIN")])
# Calculate the percentage of active offer takers
percentage_active <- sum_active / sum(value_counts) * 100

# Display results
message <- paste(" Number of offer takers:", value, "\n", "Percentage of offer takers:", round(percentage_y, 2), "%\n", "Number of customers who remain active after opted-in for the pilot campaign:", sum_active, "\n", "Percentage of customers who remain active after opted-in for the pilot campaign:", round(percentage_active, 2), "%\n")
cat(message)
##  Number of offer takers: 4499 
##  Percentage of offer takers: 62.86 %
##  Number of customers who remain active after opted-in for the pilot campaign: 3047 
##  Percentage of customers who remain active after opted-in for the pilot campaign: 67.73 %

The above statistics indicating the pilot campaign launched was moderately successful because:

-The pilot campaign launched has a total number of 4,499 customers out of 7,157 who opted in for the new rate plans. The opt in rate for the campaign is 63%.

-There are 3,047 customers out of 4,499 opted-in customers who remain active after the campaign. The active rate is 68%.


Objective 2: Offer Takers’ Profile

The profile of offer takers is identified by observing the graph distribution grouped by offer taker. A more precise analysis can be done by conducting pivot table in future.

data_profiling_after <- function(x, group_by_col) {
  
  # Iterate over column names
  for (col_name in colnames(x)) {
    
    # Check if column is numeric
    if (is.numeric(x[[col_name]])) {
      
      # Generate violin plot for numeric columns grouped by OFFER_TAKER
      p <- ggplot(x, aes(x = col_name, y = x[[col_name]], fill = col_name)) +
        geom_violin() +
        labs(x = "Column", y = "Range", title = paste("Violin Plot of", col_name)) +
        theme_minimal() +
        facet_wrap(~get(group_by_col))
      print(p)
      
    }
    
    # Column is not numeric (assumed categorical)
    else {
      # Calculate counts for each category grouped by OFFER_TAKER
      counts <- table(x[[col_name]], x[[group_by_col]])
      
      # Create a data frame for plotting
      df_counts <- data.frame(category = rownames(counts), count = as.numeric(counts), OFFER_TAKER = rep(colnames(counts), each = nrow(counts)))
      
      # Generate bar plot for categorical columns grouped by OFFER_TAKER
      p <- ggplot(df_counts, aes(x = count, y = category, fill = OFFER_TAKER)) +
        geom_col() +
        labs(x = "Count", y = "Category", title = paste("Bar Plot of", col_name)) +
        theme_minimal() +
        facet_wrap(~get(group_by_col))
      print(p)
    }
  }
}

# Display graphs grouped by OFFER_TAKER
data_profiling_after(cleaned_dataset, "OFFER_TAKER")

Based on EDA, the customers who tend to take up the offer have the following characteristics:

-Campaign offer takers showed a higher numbers of opt in rate among male compared to female.

-Campaign offer takers are mostly from age group between 22 and 36 years old with tenure more than 1 year.

-Most of the campaign takers are mainly from Klang Valley, then follow by Sabah and Sarawak.

-Most of the campaign takers have higher ARPU, higher CPA revenue, higher reload amount, higher voice and data usage and tends to purchase data plan.

-Majority of the customers remain active after the campaign but 11 takers terminated their lines.


Objective 3: Offer Takers’ Predictive Modelling

Three famous classification models are chosen to predict which customers will more likely to opt-in for the “Right Planning” campaign based on their usage and revenue behavior, which are Random Forest Classifier, K-Nearest Neighbor (KNN) and Support Vector machine (SVM). The best performed model will be chosen as the prediction model for this study.

Evaluation metric of accuracy will be used in evaluating the models’ performance as the most common metric used to evaluate the performance of a classification predictive model is accuracy. On top of that, the dataset provided is consider a balanced dataset thus accuracy is chosen.

  1. Random Forest Classifier
# Set the random seed
set.seed(123)

# Train the Random Forest Classifier
classifier_RF = randomForest(x = train_data[, -ncol(train_data)],
                             y = as.factor(train_data$OFFER_TAKER),
                             ntree = 500,
                             importance = TRUE)

# Make predictions on the test data
preds = predict(classifier_RF, newdata = test_data[, -ncol(test_data)])

# Calculate the confusion matrix/ accuracy of the predictions
conf_mat_RF = table(preds, test_data$OFFER_TAKER)
accuracy_RF = sum(diag(conf_mat_RF)) / sum(conf_mat_RF)
# Display output of confusion matrix and accuracy 
outputRF <- paste("Confusion Matrix (Prediction):", paste(capture.output(print(conf_mat_RF)), collapse = "\n"), "\n", "Accuracy of Random Forest Classifier:", round(accuracy_RF * 100, 2), "%")
cat(outputRF)
## Confusion Matrix (Prediction):      
## preds   0   1
##     0 462  90
##     1 113 766 
##  Accuracy of Random Forest Classifier: 85.81 %

Random Forest model achieved an accuracy of 85.51%.


  1. K-Nearest Neighbor (KNN)
# Set the random seed
set.seed(123)

# Train the KNN Classifier
classifier_KNN <- train(x = train_data[, -ncol(train_data)],
                        y = as.factor(train_data$OFFER_TAKER),
                        method = "knn",
                        trControl = trainControl(method = "none"),
                        preProcess = c("center", "scale"))

# Make predictions on the test data
preds_KNN <- predict(classifier_KNN, newdata = test_data[, -ncol(test_data)])

# Calculate the confusion matrix/ accuracy of the predictions
conf_mat_KNN <- table(preds_KNN, test_data$OFFER_TAKER)
accuracy_KNN <- sum(diag(conf_mat_KNN)) / sum(conf_mat_KNN)
# Display output of confusion matrix and accuracy 
outputKNN <- paste("Confusion Matrix (Prediction):", paste(capture.output(print(conf_mat_KNN)), collapse = "\n"), "\n", "Accuracy of K-Nearest Neighbour:", round(accuracy_KNN * 100, 2), "%")
cat(outputKNN)
## Confusion Matrix (Prediction):          
## preds_KNN   0   1
##         0 438 124
##         1 137 732 
##  Accuracy of K-Nearest Neighbour: 81.76 %

K-Nearest Neighbor model achieved an accuracy of 81.76%.


  1. Support Vector Machine (SVM)
# Set the random seed
set.seed(123)

# Train the SVM Classifier
classifier_SVM <- svm(x = train_data[, -ncol(train_data)],
                      y = as.factor(train_data$OFFER_TAKER),
                      kernel = "radial")

# Make predictions on the test data
preds_SVM <- predict(classifier_SVM, newdata = test_data[, -ncol(test_data)])

# Calculate the confusion matrix/ accuracy of the predictions
conf_mat_SVM <- table(preds_SVM, test_data$OFFER_TAKER)
accuracy_SVM <- sum(diag(conf_mat_SVM)) / sum(conf_mat_SVM)
# Display output of confusion matrix and accuracy 
outputSVM <- paste("Confusion Matrix (Prediction):", paste(capture.output(print(conf_mat_SVM)), collapse = "\n"), "\n", "Accuracy of Support Vector Machine", round(accuracy_SVM * 100, 2), "%")
cat(outputSVM)
## Confusion Matrix (Prediction):          
## preds_SVM   0   1
##         0 468 108
##         1 107 748 
##  Accuracy of Support Vector Machine 84.98 %

Support Vector Machine model achieved an accuracy of 84.98%.


Feature Importance for Random Forest Classifiers

# Feature importance
imp_RF <- importance(classifier_RF)
features <- paste("Random Forest Classifier Feature Importance:\n", paste(capture.output(print(imp_RF)), collapse = "\n"))
cat(features)
## Random Forest Classifier Feature Importance:
##                            0           1 MeanDecreaseAccuracy MeanDecreaseGini
## TENURE             8.820628  24.3849919             24.13703        206.39523
## AGE                5.278590   9.2401210             11.41115        149.28534
## DATA_PURC_BEFORE 104.606113 157.6559667            135.19721       1087.75961
## DATA_USG_BEFORE   32.503856 -11.7541345             18.30836        233.48710
## VOICE_USG_BEFORE  16.258360   7.1245668             20.95345        104.26227
## RLD_IND_BEFORE    22.069320   2.9338180             21.51774         51.02151
## ARPU_BEFORE       52.064695  -0.1964525             50.16256        253.43854

Throughout the performance for these classifiers, Random Forest Classifier has the highest accuracy hence it is selected as the best classification model / predictive model in this study. A higher value in Mean Decrease Accuracy indicates that the feature has a stronger influence on the accuracy of the model. The top 5 important features are DATA_PURC_BEFORE, ARPU_BEFORE, RLD_IND_BEFORE, TENURE and DATA_USG_BEFORE. These variables can be used to predict if a customer tends to opt in for the new offer.

A customer tends to take up the offer of the campaign if the customer has the following characteristics: A longer tenure, high average revenue, perform reload before the campaign, purchase any data plan before the campaign and high data usage.


Objective 4: Relationship between Customers’ Behavior (Usage) and Revenue Generated

This section displayed the relationship between the customers’ behavior in terms of usage and revenue generated before the campaign using multiple linear regression.

  1. Select usage features and revenue feature
# Select relevant feature
reg_final_df = subset(final_df, select = c(DATA_USG_BEFORE, ARPU_BEFORE, VOICE_USG_BEFORE))


  1. Split the dataset into train and test set
# Set the random seed
set.seed(123)

# Split the data into train and test set 
train_idx = createDataPartition(reg_final_df$ARPU_BEFORE, p = 0.8, list = FALSE)
reg_train_data = reg_final_df[train_idx,]
reg_test_data = reg_final_df[-train_idx,]

# Check dimensions of train and test set 
cat(" Dimensions of train data:", dim(reg_train_data), "\n", "Dimensions of test data:", dim(reg_test_data), "\n")
##  Dimensions of train data: 5726 3 
##  Dimensions of test data: 1431 3


  1. Perform scaling on the train and test set and assign target for regression model
# Set the random seed
set.seed(123)

# Scaling
train_data_scaled = reg_train_data %>%
  select(-ARPU_BEFORE) %>%
  scale() %>%
  as.data.frame() %>%
  cbind(reg_train_data$ARPU_BEFORE)

colnames(train_data_scaled)[3] = "target"


  1. Linear Regression
# Train the Regression Model
lm_model = lm(target ~ ., data = train_data_scaled)

# Make predictions on the test data 
test_data_scaled = reg_test_data %>%
  select(-ARPU_BEFORE) %>%
  scale() %>%
  as.data.frame() %>%
  cbind(reg_test_data$ARPU_BEFORE)
predictions = predict(lm_model, newdata = test_data_scaled)

colnames(test_data_scaled)[3] = "target"


Multiple Linear Regressin Model Performance

# Result
RMSE = sqrt(mean((predictions - test_data_scaled$target)^2))
R2 = cor(predictions, test_data_scaled$target)^2
## Performance of Linear Regression:
## 
## RMSE: 75.65 
## R-squared: 0.00095

RMSE value of 75.65 suggests that on average, the predictions of the regression model have an error of approximately 75.65 units. R-squared value of 1 indicates that the predictors explain all of the variance. However, based on the output, the R-squared value of 0.00095 suggests that the independent variables in the model explain a very small portion (approximately 0.095%) of the variance in the dependent variable.

In other words, the model has very little predictive power in explaining the variability of the data. This means that customers’ behavior in terms of usage is not the determinant of the revenue generated. This is indeed true because revenue generated is not just coming from usage but other factors such as purchasing roaming pass and added-value service. When Base Management Team try to design a new offer, the structure content cannot only consider data usage and voice usage will brings more revenue.


Conclusion

In conclusion, the research objectives of this study are achieved by using several statistical methods.

First of all, the effectiveness of the pilot campaign is evaluated where the statistics showed the pilot campaign lauched is moderately success with 63% take up rate and 68% active rate among the offer takers.

Second, the campaign offer taker’s profile is identified. Campaign offer takers are mostly from age group between 22 and 36 years old with tenure more than 1 year and gender male. Most of them are mainly from Klang Valley, then follow by Sabah and Sarawak. Majority of the offer takers have higher ARPU, higher CPA revenue, higher reload amount, higher voice and data usage and tends to purchase data plan.

Third, a classification model is developed to predict if the customers will opt-in for the “Right Planning” campaign based on usage and revenue behavior. Random Forest Classifier with accuracy of 85.51% is chosen as the predictive model because it has the higghest accuracy compared to the other 2 models. The top 5 important features are DATA_PURC_BEFORE, ARPU_BEFORE, RLD_IND_BEFORE, TENURE and DATA_USG_BEFORE.

Last but not least, the relationship between the customers’ behavior in terms of usage and revenue generated before the campaign using multiple linear regression is investigated. It is clearly proved that revenue generated is not directly affected by the data and voice usage. Other factors such as purchasing roaming pass and added-value service should be considered when Base Management Team would like to increase revenue.


Citation

Cote, C. (2021, January 21). What is Marketing Analytics? Harvard Business School Online. Retrieved May 20, 2023, from https://online.hbs.edu/blog/post/what-is-marketing-analytics